[PATCH] BUG FIX: redo will abort, due to inconsistent page found in BRIN_REGULAR_PAGE

  • Jump to comment-1
    wanghaiyang.001@bytedance.com2022-07-28T08:10:44+00:00
    Hi hackers, I found that when wal_consistency_checking = brin is set, it may cause redo abort, all the standby-nodes lost, and the primary node can not be restart. This bug exists in all versions of PostgreSQL. The operation steps are as follows: 1. Create a primary instance, set wal_consistency_checking = brin, and start the primary instance. initdb -D pg_test echo "wal_consistency_checking = brin" >> pg_test/postgresql.conf echo "port=53320" >> pg_test/postgresql.conf pg_ctl start -D pg_test -l pg_test.logfile 2. Create a standby instance. pg_basebackup -R -p 53320 -D pg_test_slave echo "wal_consistency_checking = brin" >> pg_test_slave/postgresql.conf echo "port=53321" >> pg_test_slave/postgresql.conf pg_ctl start -D pg_test_slave -l pg_test_slave.logfile 3. Execute brin_redo_abort.sql through psql, and find that the standby machine is lost. psql -p 53320 -f brin_redo_abort.sql 4. The standby instance is lost during redo, FATAL messages as follows: FATAL: inconsistent page found, rel 1663/12978/16387, forknum 0, blkno 2 5. The primary instance cannot be restarted through pg_ctl restart -mi. pg_ctl restart -D pg_test -mi -l pg_test.logfile 6. FATAL messages when restart primary instance as follows: FATAL: inconsistent page found, rel 1663/12978/16387, forknum 0, blkno 2 I analyzed the reasons as follows: 1. When the revmap needs to be extended by brinRevmapExtend, we may set BRIN_EVACUATE_PAGE flag on a REGULAR_PAGE to prevent other concurrent backends from adding more BrinTuple to that page in brin_start_evacuating_page. 2. But, during redo-process, it is not needed to set BRIN_EVACUATE_PAGE flag on that REGULAR_PAGE after removing the old BrinTuple in brin_xlog_update, since no one will add BrinTuple to that Page at this time. 3. As a result, this will cause a FATAL message to be thrown in CheckXLogConsistency after redo, due to inconsistency checking of the BRIN_EVACUATE_PAGE flag, finally cause redo to abort. 4. Therefore, the BRIN_EVACUATE_PAGE flag should be cleared before CheckXLogConsistency. For the above reasons, the patch file, sql file, shell script file, and the log files are given in the attachment. Best Regards! Haiyang Wang